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^  Attempts  to  understand  factors  that  Influence  raters'  ability  to  provide 

accurate  ratings  have,  on  occasion,  focused  on  static  characteristics  of  ratees 
(e.g.,  race  and  gender).  The  present  study  investigated  dynamic  characteristics 
related  to  the  pattern  and  level  of  performance  displayed  by  the  ratee.  In  a 
repeated  measures  design,  37  raters  were  required  to  rate  the  performance  of  4 
subordinates  who  differed  from  each  other  in  the  level  of  their  performance  and 
its  consistency.  It  was  hypothesized  that  performance  characteristics  would  have 
differential  impact  on  several  types  of  performance  accuracy  measures.  Cronbach's 
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overall  accuracy  as  well  as  two  indices  of  accuracy  based  on  signal  detection 
theory,  labeled  claasification  and  behavioral  accuracy  by  Lord  (1985),  were 
compared.  Results  showed  that  characteristics  of  ratee  performance  did  affect 
the  accuracy  measures  differentially.  |cu^i/-.c  r Jc  : 
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The  Impact  of  Performance  Consistency  and  Performance 
Level  on  Alternative  Measures  of 
Rater  Accuracy 

Researchers  In  the  area  of  performance  appraisal  have 
Identified  a  number  of  factors  thought  to  Influence  performance 
ratings.  These  are:  (1)  the  appraisal  Instrument;  (2)  the  rater; 
(3)  the  ratee;  and  (4)  the  performance  rating  context.  Most  of  the 
work,  thus  far,  has  focused  on  either  the  rating  Instrument  or  the 
rater.  With  respect  to  the  Instrument,  the  major  concerns  have 
been  Increasing  the  reliability  and  validity  of  ratings  and 
reducing  rating  errors  such  as  halo  and  leniency  (e.g.,  Latham  & 
Wexley,  1977;  Smith  &  Kendall,  1963).  Similarly,  the  primary 
outcome  of  Interest  when  studying  rater  characteristics  has  been 
the  reduction  of  rating  errors  (Borman,  1979;  Casclo  &  Valenzl, 
1977;  Taft,  1955). 

Recently,  It  has  been  argued  that  the  primary  goal  of 
performance  appraisals  should  be  to  obtain  ratings  that  reflect,  to 
the  extent  possible,  the  actual  behavior  of  the  ratee  (Bernardln  & 
Pence,  1980;  Borman,  1978).  The  appropriate  criterion  for 
performance  appraisal  from  this  perspective  is  rating  accuracy. 
Given  this,  research  should  focus  on  Identifying  factors  that 
reduce  the  ability  of  raters  to  make  accurate  ratings. 
Unfortunately,  while  many  potential  Inhibitors  of  accuracy  In 
performance  evaluations  have  been  suggested  (Feldman  &  Hllterman, 
1977;  Terborg  &  Ilgen,  1975),  empirical  research  Is  lacking. 
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The  present  research  addresses  rater  accuracy  by  focusing  on 
one  very  prominent  feature  of  the  rating  stimulus — the  ratee  him  or 
herself.  Most  of  the  research  on  the  role  of  the  ratee  In 
performance  appraisal  has  focused  on  qualities  of  the  ratee  that 
are  relatively  unchanging  over  time,  such  as  the  ratee's  race  or 
gender,  and  examined  their  Impact  on  the  quality  of  performance 
ratings  (Hamner,  Kim,  Baird,  &  Blgoness,  1974).  This  approach  to 
studying  the  ratee  Is  Important  because  static  ratee 
characteristics  are  likely  to  be  used  by  raters  as  cues  In  forming 
an  Initial  Impression. 

In  contrast  to  static  Information  about  ratees  are  dynamic 
cues  which  change  over  time  and,  thus,  result  In  Impression 
modification.  One  such  dynamic  quality  Is  ratee  Job  performance. 
Although  ratee  job  performance  Is  the  criterion  against  which  to 
judge  the  validity  of  ratings,  the  nature  of  that  performance 
Information  may  Introduce  errors  Into  ratings  by  Influencing  the 
way  In  which  the  Information  Is  processed.  There  are  at  least  two 
characteristics  of  ratee  performance  that  may  act  to  reduce  the 
accuracy  of  performance  appraisals:  level  (good  vs.  poor 
performer)  and  consistency  over  time  (consistent  vs.  inconsistent 
performer).  Both  characteristics  have  been  addressed  in  previous 
research  (e.g.,  DeNlsl  &  Stevens,  1981;  Gordon,  1970;  Scott  & 
Hamner,  1975)  but  they  have  typically  been  related  to  such 
dependent  variables  as  attributions  of  causality,  allocation  of 
organizational  rewards  and  ratings  of  motivation  and  ability  rather 
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than  to  rating  accuracy.  The  present  study  will  examine  the  effect 
of  ratee  performance  level  and  performance  consistency  on  several 
different  measures  of  rating  accuracy. 

Ratee  Performance  Level 

Those  studies  which  have  compared  the  Impact  of  known  ratee 
performance  along  with  other  variables  have  found  that  performance 
is  the  best  predictor  of  performance  ratings  (Bigoness,  1976; 

Ramner  et  al.,  1974;  Leventhal,  Perry  &  Abraml,  1977),  but  that  It 
accounts  for  little  more  than  30Z  of  the  variance  In  ratings 
(Ramner  et  al.,  1974).  Cues  other  than  the  actual  performance 
behavior  obviously  Influence  the  ratings.  This  suggests  that 
nonperformance-related  Information  can  potentially  account  for  a 
large  proportion  of  the  variance  In  performance  ratings  and,  thus, 
reduce  their  validity.  In  fact,  Gordon  (1970)  found  that  the  level 
of  performance,  itself.  Influenced  sensitivity  to  performance 
relevant  behaviors.  Re  found  that,  on  the  average,  raters 
correctly  Identified  88Z  of  the  desirable  behaviors  exhibited  by 
ratees  but  only  73Z  of  the  undesirable  ones  and  labeled  this  effect 
the  Differential  Accuracy  Phenomenon  (DAP).  One  purpose  of  the 
present  study  was  to  test  the  DAP. 

Ratee  Performance  Consistency 
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the  consistency  of  a  ratee's  performance  (Borman,  1983);  (2) 
developing  performance  appraisal  systems  that  take  Into  account 
ratee  performance  Inconsistency  (Kane  &  Lawler,  1979);  and  (3) 
determining  the  Impact  of  performance  Inconsistency  on  ratings. 

Studies  In  the  latter  area  are  most  similar  to  the  present 
research.  The  few  studies  conducted  (e.g.,  DeNlsl  &  Stevens,  1981; 
Scott  &  Hamner,  1975)  suggest  that  performance  consistency  does 
Influence  performance  ratings.  In  general,  variable  performance 
tends  to  result  In  more  negative  ratings.  For  example,  although 
variable  performers  were  rated  as  having  greater  ability,  they  were 
given  lower  ratings  of  motivation  when  compared  to  consistent 
performers  (Scott  &  Hamner,  1975).  Similarly,  DeNlsl  and  Stevens 
(1981)  found  that  among  low  performers,  variable  performers 
received  more  negative  ratings  on  a  composite  variable  (consisting 
of  ratings  of  ability  and  motivation,  allocation  of  organizational 
rewards  and  performance  attributions)  than  did  stable  performers. 
Both  studies  found  evidence  for  a  recency  effect  with  ascending 
performance  receiving  more  favorable  ratings.  Although  these 
studies  Indicate  that  ratee  performance  consistency  affects 
ratings,  they  do  not  provide  any  Insight  Into  the  effect  of 
consistency  on  rating  accuracy.  The  present  study  explicitly 
examined  the  effect  of  performance  consistency  on  several  measures 
of  rating  accuracy  as  well  as  on  the  rating  process  In  general. 
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Hypotheses 

Performance  Consistency  and  Sanpllag.  One  relevant  rating 
process  variable  Is  the  amount  of  Information  sampled  about  ratee 
performance  collected  by  the  rater.  The  cognitive  processing  view 
of  performance  appraisal  suggests  that  there  will  be  an  Inverse 
relationship  between  performance  consistency  and  sampling. 

According  to  this  view,  during  a  rater's  Initial  exposure  to 
ratees,  she  or  he  attempts  to  form  some  kind  of  Impression  of 
ratees.  This  Involves  placing  them  Into  categories  that  facilitate 
making  sense  (Uelck,  1979)  of  their  behavior.  For  consistent 
performers  this  categorization  process  should  not  be  problemmatlc 
since  the  ratee  clearly  can  be  categorized  or  labeled  as  either 
good  or  poor  performer.  Furthermore,  subsequent  behaviors  of 
consistent  performers  should  match  the  Initial  categorization  and 
not  be  questioned.  On  the  other  hand,  when  a  ratee's  performance 
Is  Inconsistent,  Initial  categorization  becomes  more  difficult  and 
later  observations  will  fall  to  fit  the  category,  resulting  In 
controlled  re-categorlzatlon  processes  (Feldman,  1981).  Consistent 
with  this  notion,  research  Indicates  that  dlsconflrmed  expectations 
about  a  stimulus  person  trigger  the  percelver's  search  for  causal 
information  (Pysczynskl  &  Greenberg,  1981;  Wong  &  Weiner,  1981). 
This  suggests  our  first  two  hypotheses: 


HI:  Perceived  rating  difficulty  will  be  greater  for  Inconsistent 
performers  than  for  consistent  performers. 
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H2:  More  sampling  of  Information  about  ratee  performance  will  occur 

for  Inconsistent  performers  than  for  consistent  performers. 

Performance  Consistency  and  Accuracy.  The  relationship 
between  performance  consistency  and  accuracy  Is  more  complex. 

While  It  could  be  argued  that  the  greater  ease  of  categorizing  and, 
thus,  evaluating  consistent  performers  should  result  In  greater 
rating  accuracy.  It  might  also  be  argued  that  the  greater  amount  of 
Information  obtained  from  sampling  for  Inconsistent  performers 
would  result  In  greater  rating  accuracy  (Favero  &  Ilgen,  1985; 
Henemen  &  Wexley,  1983).  The  former  suggests  greater  accuracy  In 
evaluating  consistent  performers  while  the  latter  suggests  that 
accuracy  will  be  greater  for  Inconsistent  performers. 

The  two  explanations  just  offered,  which  appear,  at  first 
glance,  to  be  antithetical  really  are  not  because  each  assumes  a 
different  conceptualization  of  rating  accuracy.  These  different 
conceptualizations  of  rating  accuracy  were  described  by  Lord 
(1985),  who  used  signal  detection  theory  to  distinguish  between 
what  he  called  classification  accuracy  and  behavioral  accuracy. 
Classification  accuracy  Is  the  apparent  accuracy  that  results  when 
raters  evaluate  people  based  on  general  Impressions.  To  the  extent 
that  the  ratee's  actual  behavior  Is  consistent  with  the  Impression, 
ratings  made  on  the  basis  of  this  Impression  will  appear  to  be 
accurate  (l.e.,  classification  accuracy  will  be  high).  In 
contrast,  behavioral  accuracy  reflects  the  ability  of  raters  to 
Identify  specific  behaviors  exhibited  by  a  ratee. 
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Conceptually  these  two  types  of  accuracy  are  not  necessarily 
related  to  each  other  and  may  even  covary  negatively.  For  example, 
classification  accuracy  la  likely  to  be  highest  when  a  rater  Is 
able  to  simplify  the  rating  process  by  Integrating  ratee  behavior 
Into  a  single  cognitive  category  and  then  evaluating  the  ratee  In 
accordance  with  the  categorization.  In  this  situation,  however, 
behavioral  accuracy  Is  likely  to  be  low  since  Integrating  ratee 
behaviors  Into  a  cognitive  category  tends  to  result  In  specific 
behaviors  being  forgotten.  Other  examples  could  be  given  where  the 
two  forms  covary  positively.  Thus,  although  behavioral  accuracy  Is 
what  Is  typically  meant  when  the  term  “accuracy"  Is  used,  most 
common  operational  definitions  of  accuracy  measure  classification 
accuracy.  Specifically,  accuracy  as  measured  by  the  components  of 
accuracy  Identified  by  Cronbach  (19SS)  only  requires  that  raters  be 
able  to  form  a  general  Impression  of  ratees  and  then  evaluate  them 
on  the  basis  of  this  Impression.  The  knowledge  of  actual  ratee 
behaviors  required  for  behavioral  accuracy  Is  not  necessary  In 
order  to  appear  accurate. 

When  conceptualized  as  classification  accuracy,  the  primary 
requirement  for  accuracy  Is  that  raters  be  able  to  place  ratees 
Into  global  categories.  Classification  accuracy  should  be  higher 
for  consistent  performers  because  of  the  greater  ease  with  which  an 
Impression  can  be  formed  about  these  ratees.  In  addition,  sampling 
may  help  raters  to  develop  and  stabilize  an  Impression  of  ratees 
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and,  therefore.  Increase  classification  accuracy.  Thus,  we 
hypothesize  that: 

H3:  There  will  be  a  positive  relationship  between  the  amount  of 
sampling  and  classification  accuracy. 

H4:  Rating  accuracy  as  measured  by  classification-type  accuracy 
measures  will  be  greater  for  consistent  performers  than  for 
inconsistent  performers. 

The  reverse  is  hypothesized  for  the  relationship  between 
behavioral  accuracy  and  performance  consistency.  There  are  three 
reasons  for  this  prediction.  The  first  relates  to  our  hypothesis 
that  inconsistent  performance  will  lead  raters  to  sample  more 
performance  information.  Presumedly  those  who  observe  more 
behavior  should  be  more  likely  to  recall  those  behaviors  observed 
and,  thus,  score  higher  on  behavioral  accuracy  measures.  Second, 
previous  research  implies  that  information  inconsistent  with 
expectations  is  stored  in  memory  in  a  unique  way  and,  thus,  is  more 
likely  to  be  recalled  (Graesser,  Gordon  &  Sawyer,  1979;  Graesser, 
Woll,  Kowalski  &  Smith,  1980;  Woll  &  Graesser,  1982).  Research  in 
the  leadership  area  is  consistent  with  this  notion  (e.g.,  Phillips 
&  Lord,  1982;  Phillips,  1984).  In  addition,  Hastie  (1980)  found 
that  memory  for  behavior  that  is  inconsistent  with  general 
impressions  is  greater  than  is  memory  for  behavior  consistent  with 
these  impressions. 

A  third  reason  for  the  hypothesized  greater  behavioral 
accuracy  for  inconsistent  performers  is  that  the  greater  difficulty 
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of  categorizing  behaviors  observed  from  inconsistent  ratees  should 
force  raters  to  focus  more  on  specific  ratee  behaviors  when 
evaluating  their  performance  than  on  a  general  impression.  This 
should  reduce  the  tendency  for  raters  to  forget  ratee  behaviors  and 
to  attribute  to  ratees  category-consistent  behaviors  which  they  did 
not  exhibit.  The  mechanisms  just  discussed  suggest  the  following 
h3rpothese8 : 

H5:  The  amount  of  sampling  of  behavioral  information  will  be 
positively  related  to  behavioral  accuracy. 

H6:  Raters  will  correctly  identify  more  behaviors  for  inconsistent 
performers  than  for  consistent  performers. 

H7:  Raters  will  be  more  likely  to  attribute  nonpresent  behaviors 
consistent  with  the  ratee's  category  to  consistent  performers 
than  to  Inconsistent  performers. 

H8:  Rating  accuracy  as  measured  using  behavioral  accuracy  will  be 
greater  for  inconsistent  performers  than  for  consistent 
performers. 

Performance  Level.  Based  on  the  differential  accuracy 
phenomenon  identified  by  Gordon  (1970),  raters  should  be  more 
accurate  evaluating  good  performers  than  poor  performers.  Since 
there  is  no  reason  to  expect  a  difference  between  accuracy  measures 
on  performance  level  it  is  hypothesized  that: 

H9:  Classification  and  behavioral  accuracy  will  be  greater  for  good 
performers  than  for  poor  performers. 
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Method 

Overview 

Undergraduates  were  hired  to  play  the  role  of  an  office 
manager  who  supervised  four  secretaries.  The  four  secretaries 
worked  as  a  team  In  order  to  complete  work  assignments  for  the 
faculty  members  in  the  department.  Four  videotapes,  one  of  each 
secretary,  allowed  supervisors  to  observe  each  secretary  at  work. 

In  order  to  enhance  the  realism  of  the  setting  several 
conditions  described  by  llgen  and  Favero  (1985)  as  highly  desirable 
for  performance  appraisal  research  were  Incorporated  Into  the 
study.  First,  the  participants  observed  the  secretaries  over  time. 
Second,  participants  had  a  variety  of  different  tasks  to  perform 
only  one  of  which  was  evaluating  performance.  Third,  the  job  of 
secretary  In  a  department  at  a  university  was  chosen  because  It  was 
familiar  to  the  participants  and,  thus,  the  social  categories  used 
to  judge  people  in  this  position  should  be  relatively  accessible  to 
the  subject  population  (Flske  &  Kinder,  1981).  Finally,  multiple 
ratees  (four  secretaries)  were  evaluated. 

Subjects  and  Design 

A  sample  of  37  individuals  (lA  males  and  23  females),  ranging 
In  age  from  17  to  38  years  (mean  ~  22  years)  participated  In  the 
study.  Sample  size  was  based  on  a  power  analysis  assuming  a  small 
effect  size  and  desiring  power  of  .85  (Cohen  &  Cohen,  1983). 
Participants  were  recruited  through  a  newspaper  advertisement 
offering  to  pay  approximately  $18  for  four  hours  of  participation. 
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The  design  was  a  2  x  2  analysis  of  variance  with  repeated 
measures  on  both  of  the  Independent  variables.  The  Independent 
variables  were  performance  level  (good  or  poor  performer)  and 
performance  consistency  (consistent  and  Inconsistent).  Each 
participant  was  exposed  to  four  stimulus  persons  each  representing 
a  combination  of  the  two  Independent  variables  (l.e.,  a 
consistently  good  performer,  a  consistently  poor  performer,  an 
Inconsistently  good  performer  and  an  Inconsistently  poor 
performer) . 

Development  of  Stimulus  Materials 

Construction  of  Videotapes.  Four  videotapes  were  developed, 
one  for  each  of  four  secretaries.  Each  tape  contained  17  one  to 
two  minute  Incidents  for  ^ach  secretary  with  each  Incident 
representing  some  level  of  performance  on  one  or  more  of  four 
performance  dimensions.  The  performance  dimensions  were:  (1)  Job 
knowledge  and  skill,  (2)  organizational  ability,  (3)  dealing  with 
faculty/students,  and  (4)  working  cooperatively  with  other 
secretaries.  Each  videotape  depicted  between  8  and  12  examples  of 
ratee  Job  behavior  for  each  of  the  four  performance  dimensions. 

To  develop  behavioral  Incidents  for  the  videotapes,  five 
secretaries  were  Interviewed  and  asked  to  describe  Incidents  of 
good  and  poor  secretarial  performance.  One  hundred  and  one 
Incidents  were  collected  from  these  interviews  and  were  then 
converted  Into  short  descriptions  suitable  for  filming.  These 
incident  descriptions  were  evaluated  by  six  organizational  behavior 
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graduate  students  on  the  basis  of  the  dimensions  Judged  to  be 
relevant  to  the  incident  and  the  level  of  performance  represented 
by  the  incident  on  each  of  those  dimensions.  These  initial  ratings 
were  used  only  to  decide  which  incidents  to  film  for  each  secretary 
in  order  to  create  the  desired  manipulations.  They  were  not  used 
in  developing  the  performance  standards.  Four  Individuals  with 
secretarial  experience  were  hired  to  play  the  roles  of  the  four 
secretaries  in  the  videotapes.  Volunteers  filled  other  roles. 

Development  of  Performance  Standards.  Ten  persons  working 
full  time  as  secretaries  rated  the  videotaped  Incidents  to 
establish  the  performance  standards  against  which  to  judge 
accuracy.  Several  procedures  were  followed  to  enhance  the  ability 
of  the  secretaries  serving  as  expert  raters  to  provide  accurate 
performance  ratings.  First,  they  were  given  an  hour  of  training  on 
common  rating  errors  (halo,  leniency/strictness,  central  tendency, 
first  impression  and  contrast  effect)  and  the  use  of  the  rating 
scale.  Next,  they  then  practiced  rating  and  discussed  their 
ratings  among  themselves  in  order  to  establish  a  common  frame  of 
reference  for  each  performance  dimension  (Bernardln  &  Pence,  1980). 
In  addition,  the  secretaries  were:  (1)  given  a  detailed  written 
description  of  each  incident  to  read  prior  to  each  rating  session, 
(2)  shown  each  incident  twice  before  making  their  rating,  and  (3) 
encouraged  to  take  notes. 

In  making  the  actual  ratings,  the  expert  raters  were  first 
asked  to  indicate  which  of  the  four  performance  dimensions  were 
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represented  In  the  Incident.  Next,  they  rated  the  level  of 
performance  on  the  selected  dlmenslon(s)  using  seven  point  BARS 
developed  specifically  for  clerical  workers.  In  some  cases,  a 
videotaped  incident  Involved  more  than  one  of  the  four  secretaries. 
When  this  occurred,  all  relevant  secretaries  on  the  tape  were 
rated. 

Using  initial  criteria  of  70Z  agreement  on  the  dimensions 
represented  by  the  videotaped  incident  and  a  standard  deviation  of 
less  than  1.25  on  the  level  of  the  expert  ratings,  45  incidents 
were  acceptable,  three  clearly  unacceptable,  and  35  were  not 
clearly  either  acceptable  or  unacceptable.  The  raters  reconvened  to 
evaluate  the  35  incidents  for  which  evaluations  were  not  clear. 

The  second  rating  of  these  incidents  resulted  in  21  which  clearly 
met  the  criteria  and  14  which  did  so  except  for  one  outlier.  It 
was  decided  to  Include  all  35  in  the  set  of  80  from  which  the  final 
incidents  were  chosen  as  described  below. 

In  order  to  create  the  desired  performance  level  and 
performance  consistency  manipulations  the  following  criteria  were 
used  in  selecting  specific  incidents  for  each  secretary  from  the 
pool  of  80  incidents:  (1)  minimize  the  difference  in  the  average 
performance  level  between  consistency  conditions  and  maximize  this 
difference  between  performance  level  conditions;  (2)  maximize  the 
variance  in  performance  level  across  incidents  within  a  dimension 
for  Inconsistent  performers  and  minimize  this  variance  for 
consistent  performers;  and  (3)  maximize  the  variance  across 
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performance  dimension  means  scores  for  Inconsistent  performers  and 
minimize  this  variance  for  consistent  performers.  Table  1  presents 
these  data  for  the  Incidents  that  comprised  the  final  four 
videotapes . 

Psychometric  Properties  of  Expert  Ratings.  In  order  to  be 
confident  that  the  performance  standards  developed  for  each 
secretary  on  each  performance  dimension  were  Indicative  of  the 
secretary's  actual  performance.  It  was  necessary  to  establish  more 
precisely  the  extent  of  agreement  between  the  expert  raters  on  the 
ratings  given  to  the  secretaries.  Agreement  In  two  areas  was 
assessed:  (1)  agreement  on  the  Incidents  determined  to  be  relevant 
to  each  performance  dimension,  and  (2)  agreement  on  the  level  of 
performance  represented  In  the  Incident  on  the  relevant  performance 
dimensions. 

The  extent  of  agreement  on  the  Incidents  Judged  to  be  relevant 
to  each  dimension  was  assessed  using  coefficient  alpha.  Data  were 
coded  "0"  (the  dimension  was  not  judged  to  be  relevant  to  the 
Incident)  and  "1"  (the  dimension  was  judged  to  be  relevant  to  the 
Incident).  Alpha  coefficients  were  calculated  for  each  dimension 
both  across  Incidents  for  each  secretary  and  across  all  incidents 
regardless  of  secretary.  Since  the  alpha  coefficients  for  each 
secretary  did  not  differ  substantially  from  the  overall  alphas, 
only  the  alpha  coefficients  based  on  all  four  secretaries  are 
reported.  The  coefficient  alpha  was  .90  for  Job  Knowledge  and 
Skill,  .96  for  Dealing  with  Professors,  .97  for  Working 
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Cooperatively  with  Other  Secretaries,  and  .91  for  Organizational 
Ability. 

Intraclass  correlations  were  used  to  determine  the  degree  of 
agreement  among  the  ten  expert  raters  on  the  overall  level  of 
performance  of  each  secretary  on  each  of  the  four  performance 
dimensions.  These  were  .88  for  Job  Knowledge,  .81  for  Dealing  with 
Professors,  .87  for  Working  Cooperatively  with  Other  Secretaries, 
.93  for  Organization  of  Work,  and  .91  for  the  overall  ratings. 
Procedures 

Each  person  participated  In  three  separate  sessions.  The 
first  lasted  90  minutes,  the  second  60,  and  the  final,  two  hours. 
Performance  ratings  were  made  at  the  end  of  the  first  session  and 
during  the  final  session.  Sessions  were  separated  by  approximately 
10  to  14  days  with  the  first  and  last  sessions  23  to  27  days  apart. 
All  sessions  were  run  In  a  small  office  on  campus.  The  room  was 
furnished  with  a  table  and  chair  where  participants  worked  on  the 
In-basket  tasks.  A  bookcase  served  as  a  partition  In  the  room  and 
behind  It  was  a  video  recorder  and  monitor.  The  monitor  was  placed 
so  that  It  could  not  be  seen  from  the  table. 

Session  1.  When  participants  arrived  for  the  first  session, 
they  read  and  signed  a  consent  form  and  filled  out  an  Initial 
questionnaire  designed  to  gather  background  Information.  After 
completing  this  questionnaire,  participants  watched  a  15  minute 
videotaped  set  of  Instructions  which  described  the  study.  They 
were  provided  with  a  written  copy  of  the  information  presented  In 
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Che  videocape  (labeled  as  Che  Manageaenc  Deparcmenc  Office  Manual) 
CO  read  as  Chey  llscened  Co  Che  vldeocape. 

The  scudy  was  described  as  one  chac  dealc  wlch  how  managers 
balanced  chelr  cine  becween  conpeClng  casks.  ParclclpanCs  were 
Cold  chac  chey  would  be  playing  Che  role  of  an  office  manager  In 
Che  Deparcmenc  of  Managemenc  ac  a  flcClClous  unlverslCy.  A  brief 
descrlpclon  of  Che  organlsaclon,  Che  Job  of  office  manager,  and  Che 
naCure  and  number  of  subordlnaces  assoclaced  wlch  Che  poslcion  was 
provided . 

Mosc  of  Che  office  manager's  casks  were  presenced  In  an  In- 
baskec.  The  In-baakec  Included  such  chlngs  as  filling  ouC  a 
varleCy  of  unlverslCy  forms,  updaclng  a  deparcmencal  accounC, 
wrlclng  several  memos,  making  changes  In  Che  course  schedule  book, 
scheduling  rooms  and  Clmes  for  che  classes  of  deparcmenCal  faculcy 
and  making  a  schedule  for  che  vislc  of  a  prospecClve  Job  candldaCe 
In  all,  22  Casks  were  Included  In  Che  In-baskec. 

In  addlclon  Co  working  on  che  In-baskec,  parClclpancs  were 
cold  chac  chey  would  be  evsluadng  Che  performance  of  each  of  che 
four  secreCaries  ac  Che  end  of  che  flrsc  and  chlrd  sessions  and 
chac  chey  could  ’’waCch"  Che  secrecarles  by  viewing  a  porclon  of  a 
vldeocape  on  each  secrecary.  The  amounC  of  clme  Chey  spenC  viewing 
Che  vldeocapes  was  up  Co  chem  afcer  che  inlclal  incroducclon.  The 
evaluaclon  form  and  ics  use  were  Chen  explained  Co  parClclpanCs. 

ParclclpanCs  were  Informed  ChaC  Chey  would  be  evaluaCed  on  cwo 
major  crlcerla,  che  quallcy  and  quanclcy  of  che  In-baskec  work 


Rater  Accuracy  -  20 


completed  and  the  accuracy  of  their  performance  evaluations  of  each 
secretary.  As  an  Incentive  to  perform  well,  a  $25.00  reward  was 
offered  for  being  the  highest  performer.  This  was  awarded  In 
addition  to  the  $18  paid  to  everyone  for  participating  In  the 
study. 

After  completion  of  the  videotaped  Instructions,  the 
experimenter  explained  In  detail  how  to  operate  the  video  recorder. 
Then  each  participant  was  shown  the  first  five  behavioral  Incidents 
on  the  videotape  for  each  secretary.  The  order  In  which 
participants  viewed  the  four  secretary  videotapes  was 
counterbalanced.  Participants  were  permitted  to  take  notes  while 
watching  the  videotapes.  After  watching  the  videotape  of  each 
secretary,  the  experimenter  briefly  reviewed  the  Instructions. 
Participants  were  then  given  30  minutes  to  complete  a  performance 
evaluation  on  each  of  the  four  secretaries  and  to  begin  work  on  the 
In-basket  tasks.  Participants  were  not  permitted  to  view  any 
additional  videotape  Incidents  prior  to  completing  the  rating  form. 

Session  2.  Session  2  was  held  approximately  12  days  after  the 
first  session.  When  participants  arrived  for  the  second  session, 
the  Instructions,  rules,  financial  Incentives,  and  operation  of  the 
equipment  were  briefly  reviewed.  Participants  were  then  shown  a 
behavioral  Incident  viewed  In  the  first  session  to  be  sure  that 
they  could  Identify  each  secretary.  During  the  session,  the 
experimenter  checked  on  each  participant  two  times  to  answer  any 
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questions  they  might  have  and  to  let  them  know  how  much  time 
remained . 

Session  3.  Initial  procedures  for  Session  3  were  the  same  as 
those  for  Session  2.  Participants  were  given  50  minutes  to  work  on 
the  In-basket  tasks  and  to  watch  the  videotapes  of  the  secretaries. 
After  50  minutes,  participants  were  asked  to  complete  a  performance 
evaluation  for  each  of  the  secretaries  and  to  fill  out  the  final 
questionnaires.  When  all  questionnaires  were  completed, 
participants  were  debriefed  and  paid  for  their  participation.  . 
Measures 

Sampling.  The  amount  of  Information  search  or  sampling  done 
by  each  rater  for  each  secretary  was  measured  as  the  total  number 
of  behavioral  Incidents  watched  for  that  secretary.  At  the  end  of 
each  session,  the  experimenter  recorded  the  number  of  Incidents 
watched  for  each  secretary.  Prior  to  beginning  the  next  session 
the  videotapes  were  set  at  the  point  where  that  person  had  stopped 
watching  the  previous  session. 

Cronbach’s  Overall  Accuracy.  The  Cronbach  measure  of 
classification  accuracy  used  was  overall  accuracy.  This  Is 
measured  as  a  standard  sum  of  squared  deviations  of  the  rater's 
rating  on  a  given  dimension  from  the  true  rating  on  that  dimension. 


It  was  calculated  as: 
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I  where  is  the  rating  given  to  the  ratee  on  dimension  1  and  Sj  l-s 

the  standard  on  dimension  1. 

Lord's  Accuracy  Measures.  The  information  to  calculate  Lord's 
(1985)  classification  and  behavioral  accuracy  measures  was  derived 
from  a  checklist  completed  by  participants  during  the  last  session. 
This  checklist  consisted  of  a  list  of  55  behaviors.  Each  of  the 
behaviors  listed  was  a  major  behavior  displayed  in  one  of  the 
incidents  on  the  videotape.  Sample  behavioral  statements  included: 
(1)  does  not  get  a  professor's  presentation  notes  and  overheads 
typed  on  time,  and  (2)  agrees  to  stay  after  regular  working  hours 
to  finish  typing  an  Important  paper  for  a  professor  in  the 
department.  Between  15  and  17  of  the  behaviors  were  exhibited  by 
each  secretary  (there  was  some  overlap  since  some  of  the  behaviors 
were  exhibited  by  more  than  one  secretary).  For  consistent 
performers,  all  of  the  behaviors  were  consistent  with  the  prototype 
of  either  a  good  or  poor  performing  secretary  while  for 
inconsistent  performers,  approximately  half  of  the  behaviors  on  the 
checklist  were  consistent  with  the  prototype. 

Participants  were  asked  to  read  each  statement  and  to  indicate 
which  secretary  (or  secretaries)  exhibited  that  behavior.  They 
were  also  told  to  indicate  which  of  the  behaviors  on  the  checklist 
they  did  not  observe.  The  latter  was  included  as  an  option  because 
the  participants  did  not  watch  all  of  the  Incidents  for  each 
secretary  and,  thus,  would  not  have  seen  some  of  the  behaviors  on 


the  checklist. 


Rater  Accuracy  -  23 


From  these  responses.  It  was  possible  to  calculate  the 
prototypical  and  nonprototyplcal  hit  rate  and  false  alarm  rate  for 
each  of  the  four  secretaries.  These  were  used  to  determine 
classification  and  behavioral  accuracy.  The  prototypical  hit  rate 
was  the  proportion  of  category-consistent  Items  correctly 
identified  as  having  been  exhibited  by  the  ratee.  The  prototypical 
false  alarm  rate  was  the  proportion  of  category-consistent 
nonpresent  Items  (l.e.,  not  exhibited  by  that  secretary)  falsely 
recognized  as  having  been  exhibited  by  the  ratee.  The 
nonprototyplcal  hit  rate  was  the  proportion  of  category- 
inconsistent  Items  correctly  attributed  to  the  ratee.  Finally,  the 
nonprototyplcal  false  alarm  rate  was  the  proportion  of  category- 
inconsistent  nonpresent  Items  incorrectly  Identified  as  having  been 
exhibited  by  the  ratee.  Note  that  for  consistent  performers  the 
nonprototyplcal  hit  rate  was  zero,  since,  due  to  the  nature  of  the 
manipulation,  these  performers  only  exhibited  behaviors  consistent 
with  the  category  of  either  good  or  poor  performer. 

From  these  hit  rates  and  false  alarm  rates,  It  was  then 
possible  to  calculate  the  classification  and  behavioral  accuracy 
measures.  Classification  accuracy  was  calculated  as  follows: 

CA  -  (PHR  +  PFAR)  -  (NPHR  +  NPFAR) 
where  CA  Is  classification  accuracy,  PHR  Is  the  prototypical  hit 
rate,  PFAR  Is  the  prototypical  false  alarm  rate,  NPHR  is  the 
nonprototyplcal  hit  rate  and  NPFAR  is  the  nonprototyplcal  false 
alarm  rate.  High  classification  accuracy  means  that  prototypical 
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behaviors  were  attributed  to  the  ratees  regardless  of  whether  or  not 
they  were  actually  exhibited  by  them.  Behavioral  accuracy  was 
calculated  using  the  following  formula: 

BA  -  (PHR  +  NPHR)  -  (PFAR  +  NPFAR) 
where  BA  Is  behavioral  accuracy  and  the  other  terms  are  as  defined 
above.  The  greater  the  degree  of  behavioral  accuracy  the  more  able 
raters  are  to  Identify  the  actual  behaviors  exhibited  by  the  ratee 
regardless  of  the  prototyplcallty  of  the  actual  behavior.^ 

Results 

Manipulation  Checks 

Two  manipulation  checks  were  carried  out.  First,  on  the  final 
questionnaire  were  two  questions  assessing  the  perceived 
consistency  of  each  secretary's  performance  (average  alpha  *  .68;  a 
separate  alpha  had  to  be  computed  for  each  secretary)  and  four 
Items  assessing  their  perceived  performance  level  (average  alpha  > 
.76).  Each  of  these  scales  was  used  as  the  dependent  variable  In  a 
repeated  measures  analysis  of  variance. 

Results  for  performance  level  Indicated  a  highly  significant 
performance  level  main  effect  (F  (1,36)  -  520.95,  p  <  .01),  with  the 
perceived  performance  level  being  higher  for  the  high  performers 
than  for  the  low  performers  (x  •  25.03  vs.  x  ■  11.20).  When  the 
dependent  variable  was  the  perceived  consistency  of  the  secretary's 
performance,  a  significant  main  effect  for  consistency  was  found 
(F  (1,36)  *  75.65,  p  <  .01).  Examination  of  the  mean  differences 
revealed  that,  as  expected,  the  high  consistency  performers  were 
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perceived  as  more  consistent  than  the  low  consistency  performers 
(x  ■  9.70  vs.  X  -  7.30).  However,  there  was  also  a  significant 
performance  level  effect  on  consistency  (F  (1,36)  -  86.43, 
p  <  .01)  with  high  performers  being  perceived  as  more  consistent  than 
low  performers  (x  ■  9.93  vs.  x  «  7.07).  This  effect  may  have  been 
due  to  the  fact  that  consistency  of  performance  was  perceived 
positively  while  Inconsistency  was  perceived  negatively  and 
participants  tended  to  attribute  positive  characteristics  to  good 
performers  and  negative  characteristics  to  poor  performers. 

Post  hoc  questioning  of  each  participant  provided  another 
check  on  the  manipulations.  Ninety-eight  percent  of  the 
participants  correctly  Identified  the  performance  level  of  all  four 
secretaries  and  77Z  correctly  Indicated  the  degree  of  consistency 
for  all  four.  For  consistency,  the  remaining  23Z  of  the 
participants  correctly  Identified  the  degree  of  consistency  for  two 
of  the  four  secretaries.  There  was  no  apparent  pattern  of 
mlsldentlf Icatlon  suggesting  that  the  mlsldentlflcatlons  resulted 
from  Individual  differences  In  perceptions  of  the  four  secretaries 
rather  than  from  the  Ineffectiveness  of  the  Intended  manipulations. 

In  combination,  these  two  manipulation  checks  provide  support  for 
the  effectiveness  of  the  performance  level  and  performance 
consistency  manipulations. 

Performance  Consistency 

Cell  means  and  standard  deviations  for  all  the  dependent 
variables  are  reported  In  Table  2.  Marginal  means  and  the  results 
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for  the  repeated  measures  analyses  of  variance  used  to  test  the 
hypotheses  are  presented  In  Tables  3  and  4,  and  will  be  described 
more  completely  below. 

Sampling  and  Rating  Difficulty.  The  first  hypothesis  stated 
that  raters  would  find  It  easier  to  evaluate  consistent  performers 
than  Inconsistent  ones.  This  hypothesis  was  tested  with  a  repeated 
measures  analysis  of  variance  using  subjects'  post  task  rating  of 
the  difficulty  of  evaluating  each  secretary  as  the  dependent 
variable.  As  predicted,  performance  consistency  significantly 
affected  rating  difficulty  (F  (1,36)  ■  43.75,  p  <  .01)  such  that  It 
was  less  difficult  to  evaluate  consistent  performers  than 
Inconsistent  ones  (see  Table  3).  It  was  also  hypothesized  that  the 
sampling  of  behavioral  Information  would  be  greater  for 
Inconsistent  performers  than  for  consistent  performers  (Hypothesis 
2).  This  hypothesis  was  not  supported  (F  (1,36)  *  1.14,  p  >  .05). 

Classification  Accuracy  Measures.  The  third  hypothesis,  which 
predicted  that  sampling  of  Information  would  be  positively 
correlated  with  classification  accuracy,  received  little  support 
for  the  Cronbach  measure  of  classification  accuracy  (overall 
accuracy).  None  of  the  correlations  between  sampling  for  a 
particular  secretary  and  the  Cronbach  classification  accuracy 
measure  were  significant  (the  average  correlation  based  on  an  r  to 
z  transformation  was  -.05).  Somewhat  different  results  were  found 
for  the  Lord  measure  of  classification  accuracy.  When  correlations 
were  computed  by  collapsing  over  performance  levels  within 
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consistency  conditions,  performance  consistency  appeared  to 
moderate  the  relationship  between  sampling  and  accuracy  such  that 
the  correlation  was  nonsignificant  for  the  consistent  ratees 
(r  (73)  *  ~>llt  p  >  >05)  but  significant  and  positive  for  the 
inconsistent  ratees  (r  (73)  >  .30,  p  <  .01). 

The  fourth  hypothesis  was  that  classification  accuracy  (using 
both  the  Cronbach  and  Lord  measures)  would  be  greater  for 
consistent  than  inconsistent  performing  ratees.  Cronbach's  measure 
of  overall  accuracy  as  well  as  Lord's  measure  of  classification 
accuracy  were  used.  A  repeated  measures  analysis  of  variance  was 
done  for  both  of  these  variables,  with  very  similar  results  being 
obtained  for  each  (see  Table  4).  Specifically,  results  revealed  a 
significant  main  effect  of  performance  consistency  on  both 
Cronbach's  overall  accuracy  (F  (1,36)  •  44.06,  p  <  .01)  and  Lord's 
measure  of  classification  accuracy  (F  (1,36)  *  108.75,  p  <  .01). 

In  both  cases  there  was  greater  accuracy  in  evaluating  consistent 
performers.  No  significant  interactions  were  found  for  either 
variable . 

Behavioral  Accuracy  Measure.  Hypothesis  5  stated  that  the 
amount  of  sampling  would  be  positively  related  to  behavioral 
accuracy.  There  was  little  support  for  this  hypothesis.  A 
separate  correlation  was  computed  between  the  number  of  incidents 
watched  for  each  secretary  and  the  degree  of  behavioral  accuracy  in 
evaluating  that  secretary.  The  average  correlation  (based  on  an  r 
to  z  transformation)  was  nonsignificant  (r  *  -.09).  In  addition. 


and  Analysis  of  Variance  Tables  for  Accuracy  Measures,  HlC  Rate  and  Prototypical 
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there  was  no  relationship  between  the  variables  when  the  data  were 
collapsed  over  all  secretaries  (r  (73)  >  *'•27,  p  >  .05). 

The  sixth  and  seventh  hypotheses  were  that  raters  would 
correctly  identify  more  behaviors  for  inconsistent  performers  than 
for  consistent  performers  (measured  as  the  hit  rate)  and  that  they 
would  be  more  likely  to  attribute  nonpresent  prototypical  behaviors 
to  consistent  performers  (measured  as  the  prototypical  false  alarm 
rate).  Results  of  a  repeated  measures  analysis  of  variance  using 
the  hit  rate  as  the  dependent  variable  found  a  significant 
performance  consistency  main  effect  (F  (1,36)  *  13.23,  p  <  .01) 
with  means  in  the  opposite  direction  of  that  hypothesized  (see 
Table  4).  Specifically,  the  hit  rate  was  found  to  be  greater  for 
consistent  performers  than  for  inconsistent  performers.  However,  a 
similar  analysis  of  the  prototypical  false  alarm  rate  found  support 
for  the  hypothesis.  The  performance  consistency  main  effect  was 
significant  (F  (1,36)  >  20.64,  p  <  .01)  with  the  false  alarm  rate 
being  greater  for  consistent  performers. 

The  eighth  hypothesis  was  that  behavioral  accuracy  would  be 
greater  for  inconsistent  performers  than  for  consistent  performers. 
This  was  tested  with  a  repeated  measures  analysis  of  variance  using 
Lord's  behavioral  accuracy  as  the  dependent  variable.  Results  for 
the  relationship  between  consistency  and  behavioral  accuracy  were 
somewhat  supportive  of  the  hypothesis  that  behavioral  accuracy 
would  be  greater  for  inconsistent  performers.  Although  no 
significant  main  effects  for  behavioral  accuracy  were  found,  a 
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significant  performance  level  by  performance  consistency 
interaction  was  obtained  (F  (1,36)  *  9.34,  p  <  .01).  Examination 
of  the  cell  means  indicated  that  behavioral  accuracy  was  highest 
for  the  consistently  good  performer  and  lowest  for  the  consistently 
poor  performer,  while  behavioral  accuracy  for  both  inconsistent 
performers  was  in  between  (see  Table  4).  Analyses  of  simple  main 
effects  within  performance  levels  revealed  that,  among  good 
performers,  behavioral  accuracy  was  significantly  greater  for  the 
consistent  performer  than  the  inconsistent  performer  (F  (1,36)  ~ 
9.13,  p  <  .01).  Among  poor  performers,  the  effect  was  not 
significant  (F  (1,36)  -  2.18,  p  <  .05). 

Although  the  finding  that  behavioral  accuracy  was  greatest  for 
the  consistently  good  performer  was  contrary  to  prediction,  it  must 
be  tempered  by  the  results  of  an  additional  analysis.  Because  no 
participants  watched  all  of  the  incidents  available  for  any  of  the 
secretaries,  some  of  the  behaviors  on  the  behavioral  checklist  were 
not  observed  by  each  participant.  This  resulted  in  the  possibility 
that  participants  could  attribute  behaviors  which  they  did  not 
actually  observe  to  one  of  the  secretaries.  Clearly,  this  is  a 
reduction  in  behavioral  accuracy  since  it  Indicates  that  raters 
Incorrectly  identified  the  behaviors  exhibited  by  the  secretaries. 
Since  this  could  not  be  incorporated  into  Lord’s  determination  of 
behavioral  accuracy,  it  was  analyzed  separately  by  counting,  for 
each  participant,  the  number  of  prototypical  behaviors  which  they 
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Incorrectly  attributed  to  each  secretary.  This  was  then  used  as 
the  dependent  variable  In  a  repeated  measures  analysis  of  variance. 

Results  revealed  a  significant  performance  consistency  main 
effect  (F  (1,36)  ■  13.32,  p  <  .01)  with  significantly  more 
unobserved  prototypical  behaviors  being  attributed  to  consistent 
performers  than  to  Inconsistent  performers.  This  finding  suggests 
that  the  actual  level  of  behavioral  accuracy  for  consistent 
performers  Is  lower  than  It  appears  to  be  and,  thus,  provides  some 
support  for  the  hypothesis  that  behavioral  accuracy  may  be  greater 
for  Inconsistent  performers. 

Performance  Level 

The  last  hypothesis  was  that  classification  and  behavioral 
accuracy  would  be  greater  for  good  performers  than  for  poor 
performers.  Contrary  to  this  hypothesis,  the  performance  level 
main  effect  for  Cronbach's  overall  accuracy  measure,  was  found  to 
be  nonsignificant  (F  (1,36)  ■  1.92,  p  ■  .17).  Using  Lord's  measure 
of  classification  accuracy  It  was  found  that  accuracy  tended  to  be 
greater  for  good  performers  than  for  poor  performers,  but  the 
results  were  only  marginally  significant  (F  (1,36)  *  3.10, 
p  ■  .08).  As  reported  earlier,  there  was  a  significant  interaction  for 
behavioral  accuracy  (F  (1,36)  *  9.34,  p  <  .01)  such  that  good 
performers  were  rated  more  accurately  than  poor  ones.  However, 
this  only  occurred  for  consistent  performers  (see  Table  4). 
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Discussion 

Previous  research  has  Identified  several  ratee  characteristics 
(e.g.,  race  and  sex)  that  can  Influence  the  accuracy  of  the 
performance  evaluations.  The  present  study  examined  tvo  potential 
dynamic  sources  of  Inaccuracy  In  performance  appraisals,  ratee 
performance  level  and  performance  consistency,  and  found  that 
performance  consistency  Influenced  the  way  In  which  raters 
processed  performance  Information. 

Researchers  studying  cognitive  processes  have  suggested  two 
possible  ways  In  which  raters  may  process  performance  Information 
(e.g.,  Nathan  &  Lord,  1983;  Murphy,  Carmen,  Martin  &  Garcia,  1982). 
In  one  approach,  raters  Integrate  behavioral  Information  Into 
general  categories  or  lmpresslon«v  of  people.  In  this  case, 
behavioral  specifics  tend  to  be  forgotten  and  general  Impressions 
become  the  basis  for  subsequent  evaluations.  In  the  second 
approach.  It  Is  thought  that  either  actual  behavioral  observations 
are  recalled  or  behavioral  observations  are  Integrated  Into 
behavioral  dimensions  which  are  recalled.  In  either  case,  the 
focus  Is  on  a  more  behavioral ly-orlented  approach  to  processing  and 
storing  Information,  which  should  then  facilitate  accuracy  In 
rating. 

The  results  of  this  study  suggest  that  one  characteristic  of 
ratee  performance.  Its  consistency,  may  Influence  the  approach  used 
to  process  Information.  Specifically,  for  consistent  performers, 
raters  were  more  likely  to  integrate  specific  observed  behaviors 
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Into  general  cognitive  categories  and  then  use  these  categories  in 
making  evaluations.  For  inconsistent  performers,  the  more 
behavioral  approach  to  processing  and  storing  information  appears 
to  be  used.  Several  findings  from  this  study  combine  to  support 
this  conclusion:  (1)  classification  accuracy  was  found  to  be 
greater  for  consistent  performers;  and  (2)  raters  were  more  likely 
to  incorrectly  attribute  both  observed  and  unobserved  prototypical 
behaviors  to  consistent  performers.  These  findings  Indicate  that 
raters  tended  to  attribute  prototypical  behaviors  to  consistent 
performing  ratees  regardless  of  whether  or  not  they  actually 
exhibited  them.  Furthermore,  raters  processed  observed  behaviors 
of  consistent  ratees  by  integrating  them  into  a  general  impression 
which  was  used  as  the  basis  for  making  performance  evaluations. 

Performance  consistency  and  the  approaches  to  processing 
information  also  affected  both  measures  of  rating  accuracy — 
behavioral  and  classification  accuracy.  Behavioral  accuracy,  the 
correct  identification  of  actual  behaviors  which  were  or  were  not 
observed,  tended  to  be  greater  when  evaluating  inconsistent  ratees. 
This  may  have  been  due  to  the  apparent  differences  between 
consistent  and  inconsistent  ratees  in  how  information  was 
processed.  Apparently,  raters  processed  and  stored  specific  ratee 
behaviors  for  inconsistent  performing  ratees  but  only  a  general 
impression  for  consistent  ratees.  The  former  approach  to 
processing  information  should  result  in  greater  behavioral  accuracy 
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while  the  latter  should  lead  to  enhanced  classification  accuracy, 
as  was  observed. 

The  finding  that  the  greatest  degree  of  behavioral  accuracy 
occurred  for  the  consistently  good  performer  was  contrary  to  our 
hypothesis.  At  first  glance,  it  appears  to  contradict  the  notion 
that  for  consistent  performers,  racers  process  performance 
information  using  general  categories.  Instead,  it  suggests  that 
raters  stored  and  recalled  specific  behaviors.  However,  since 
consistent  performers  did  not  exhibit  any  nonprototypical 
behaviors,  relying  on  a  general  impression  of  these  ratees  would 
actually  increase  the  rater's  probability  of  correctly  attributing 
prototypical  behaviors  to  these  ratees,  leading  to  the  higher  hit 
rate  which  we  observed  and,  other  things  being  equal,  increasing 
behavioral  accuracy.  This  would  also  explain  the  significant 
positive  correlation  observed  between  classification  and  behavioral 
accuracy  for  the  consistent  performers  (r  (74)  »  .33,  p  <  .01).  On 
the  other  hand,  for  inconsistent  performers  this  correlation  was 
negative  (r  (74)  *  -.29,  p  <  .01)  since,  in  this  case,  a  general 
impression  would  tend  to  reduce  behavioral  accuracy  by  increasing 
the  likelihood  that  raters  would  forget  nonprototypical  behaviors. 

Several  of  the  hypotheses  of  this  research  were  based  on  the 
premise  that  the  ratee's  performance  behavior  would  influence  the 
time  spent  observing  the  ratee  and,  in  turn,  observation  time  would 
affect  accuracy.  None  of  the  hypotheses  Involving  observation  time 
were  supported.  There  are  a  number  of  reasons  why  observation  time 
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may  not  have  functioned  as  expected.  The  most  plausible  to  us  Is 
that  Che  repeated  measures  nature  of  the  design  may  have  led 
participants  to  believe  that  they  ought  to  sample  all  raCees 
approximately  equally.  Such  a  belief  would  have  removed 
differences  among  ratees  and,  thus,  the  hypothesized  differences  In 
observation  time. 

This  study  also  found  some  support  for  the  Differential 
Accuracy  Phenomenon  Identified  by  Gordon  (1970)  since,  among 
consistent  performers,  behavioral  accuracy  was  greater  for  the  good 
performer  than  for  the  poor  ones.  However,  there  was  little 
difference  between  good  and  poor  performers  who  were  Inconsistent. 
Raters  appeared  Co  find  It  easier  to  evaluate  good  performers, 
perhaps  because  they  had  a  better  understanding  of  what  constituted 
good  performance  than  poor  performance.  Specifically,  when  someone 
exhibited  an  undesirable  behavior  It  may  have  been  difficult  for 
raters  to  determine  how  Ineffective  that  behavior  was  unless  It  was 
extremely  Ineffective.  Good  behavior,  on  the  other  hand,  may  have 
been  less  ambiguous  and,  therefore,  easier  to  Identify.  The  reason 
Che  effect  was  only  found  for  consistent  performers  Is  not  clear. 
Implications 

The  results  of  this  study  have  several  Implications.  First, 
the  distinction  between  classification  and  behavioral  accuracy  Is 
Important  as  Is  the  empirical  demonstration  of  Che  differences  In 
the  way  these  accuracy  measures  function.  Consistent  with  Lord 
(1985),  we  have  argued  that  although  accuracy,  In  a  conceptual 
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sense,  really  implies  behavioral  accuracy,  common 
operationalizations  of  accuracy  (such  as  Cronbach's)  only  tap 
classification  accuracy.  Ve  also  agree  with  Lord  (1985)  that 
researchers  in  the  area  of  performance  appraisal  looking  at  rating 
accuracy,  particularly  those  studying  cognitive  processes,  should 
begin  to  use  a  more  behavioral  approach  to  assessing  accuracy. 
Although  this  is  a  more  rigorous  accuracy  criterion  than  that 
typically  assessed,  we  believe  it  is  more  conceptually  correct  and, 
thus,  avoids  the  problem  of  rater's  appearing  to  be  accurate  with 
Cronbach's  measures  of  classification  accuracy  when,  in  fact,  in  a 
true  behavioral  sense,  they  are  not. 

While  a  behavioral  accuracy  criterion  also  has  practical 
relevance  for  such  functions  as  providing  developmental  feedback, 
in  other  situations,  classification  accuracy  may  be  all  that  is 
required  of  raters.  For  example,  when  raters  have  to  select  a 
subordinate  to  receive  an  award  or  determine  which  subordinates 
should  be  given  the  largest  or  smallest  pay  Increases,  all  that  is 
necessary  is  that  raters  be  able  to  assess,  in  a  general  way,  the 
overall  performance  level  of  the  ratee.  Given  this,  we  suggest 
that  classification  accuracy  (assessed  by  either  the  Cronbach  or 
Lord  measures)  can,  perhaps,  best  be  seen  as  a  necessary  but  not 
sufficient  condition  for  true  behavioral  accuracy. 

The  data  from  this  study  also  suggest  the  need  to  Identify 
factors  that  might  Increase  the  tendency  of  raters  to  rely  on  a 
general  impression  in  evaluating  performance  rather  than  on 
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specific  ratee  behaviors  since  this  tends  to  reduce  behavioral 
accuracy.  The  present  study  suggested  one  such  factor,  the 
consistency  of  a  ratee's  performance,  but  other  factors,  such  as 
the  race  and  sex  of  the  ratee,  might  also  result  In  a  similar 
tendency.  This  could  account  for  the  occurrence  of  Inaccuracy  In 
ratings.  From  a  practical  point  of  view,  this  study  supports  the 
suggestions  of  previous  researchers  (e.g.,  Bernardln  &  Pence,  1980) 
on  the  Importance  of  training  raters  to  focus  on  ratee  behaviors, 
using  such  things  as  behavioral  diaries. 

The  study  also  suggests  that  raters  need  to  learn  the 
distinction  between  prototypical  and  nonprototyplcal  behaviors  and 
be  aware  of  the  common  tendency  for  false  positive  prototypical 
behavior  errors.  For  example,  when  observing  multiple  ratees,  the 
rater  may  recall  observing  a  particular  behavior  but  attribute  that 
behavior  to  the  ratee  for  which  the  behavior  Is  most  prototypical. 
One  potential  outcome  of  this  tendency  would  be  for  the  ratings  of 
consistent  performers  to  be  more  extreme.  In  other  words,  good 
performers  would  generally  be  rated  higher  than  they  deserve  while 
poor  performers  would  tend  to  be  rated  lower  than  they  ought  to  be. 
If  raters  can  be  taught  to  eliminate  this  type  of  error,  perhaps 
through  procedures  such  as  reality  monitoring  (Johnson  &  Raye, 
1981),  the  accuracy  of  evaluations  might  be  Increased.  Since  some 
evidence  suggests  that  observational  accuracy  Is  positively  related 
to  rating  accuracy  (Murphy,  Garcia,  Kerkar,  Martin  &  Balzer,  1982) 
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Footnote 

number  of  other  variables  not  directly  related  to  the  major 
hypotheses  of  this  study  were  assessed  on  the  final  questionnaire 
but  were  not  reported  here. 
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